t--- title: "HW06" output: github_document ---
In cancer, mutations arise that promote growth or survival of cells. In glioblastoma multiforme and other high grade gliomas, a common mutation is a mutation of the 27th lysine (K) to a methionine (M) of the histone subunit H3, or in short H3K27M.
H3K27M is the most frequent oncohistone in brain cancers, but the biology is still not well understood. Your analysis is to look at the expression of several (27) genes to see if they are differentially expressed and plot 27 boxplots each gene. The data used in this analysis was obtained from this publication
Steps:
From the RNA-Seq files, you only need the "Name" from one file and the "TPM" column from all the files. TPM stands for "transcripts per million" and is a common unit for normalized expression data.
library(tidyverse)
library(readr)
library(stringr)
library(knitr)
#hint, using apply (specifically sapply) you can read in the data into a list and then bind the columns together. Or you can use a for loop too.
#you only need the
# Create a function to read the specific columns
read_col <- function(data){
file <- read_tsv(data)
return(file[["TPM"]])
}
# Get all the file directories
RNA_files <- c(str_c("RNA_Seq_processed/H3K27M/",list.files("RNA_Seq_processed/H3K27M")), str_c("RNA_Seq_processed/WT/",list.files("RNA_Seq_processed/WT")))
# Read and combine the "TPM" column of all the files
GBM.transcripts <- data.frame(sapply(RNA_files,read_col))
# Get the "name" column
file1 <- read_tsv(RNA_files[[1]])
# Combine the "name" column and the "TPM" columns
GBM.transcripts$gene_id <- file1$Name
Now, install the packages commented below (if needed), and then use this code to map the transcript IDs to gene symbols. To use this code, you need a dataframe called GBM.transcripts that has the first column "gene_id" that contains the transcript ids (e.g. ENST00000456328.2) and the remaining columns contain the TPM data. So long as the first column contains the "gene_id" column as mentioned above, this should run.
#install.packages("BiocManager")
#BiocManager::install("ensembldb")
#BiocManager::install("EnsDb.Hsapiens.v75")
library(ensembldb)
library(EnsDb.Hsapiens.v75)
ens.GBM.transcripts <- GBM.transcripts %>%
mutate(gene_id = gsub(pattern = "\\..*", "", .$gene_id))
map <- ensembldb::select(EnsDb.Hsapiens.v75, keys = ens.GBM.transcripts$gene_id,
keytype = "TXID", columns = c("SYMBOL", "TXID"))
ens.mapped_GBM <- left_join(ens.GBM.transcripts, map, by = c("gene_id" = "TXID")) %>%
dplyr::select(-1) %>%
dplyr::select(gene_symbol = SYMBOL, everything())
ens.mapped_GBM <- ens.mapped_GBM[!duplicated(ens.mapped_GBM$gene_symbol),] #remove duplicated gene symbols
#these are removed instead of averaged because they simply do not correlate particularly well.
ens.mapped_GBM <- ens.mapped_GBM[!is.na(ens.mapped_GBM$gene_symbol),] #remove NA values
Do the t-test and make a table of the t-test results!
#run this code to unload the libraries from before, it might be helpful because the select() function from dplyr might be hidden otherwise
detach(package:EnsDb.Hsapiens.v75, unload = T)
detach(package:ensembldb, unload = T)
#add in your own gene of interest!!!
genes_of_interest <- c("OR4G4P", "IRX1", "OSR1", "DCHS2", "BRINP3", "TOB2P1", "FOXD1", "ZFPM2", "GLB1", "ALG5", "TRIM4", "ADARB2", "PCDHGA11", "IDH1", "EGFR", "MGMT", "TERT", "PTEN", "TP53", "RB1", "ATRX", "PDGFRA", "PIK3CA", "MICA", "CDKN2A", "EZH2", "BRD2")
# I found there are 2 "PTEN", so I deleted one. Hope that is okay.
GBM.genes.of.interest <- filter(ens.mapped_GBM, gene_symbol %in% genes_of_interest)
#Now perform a t-test between the H3K mutated and the wt samples. There are many ways to do this actually, you can use a for loop or you could do the tidy alternative with broom(), but the for loop is probably the easiest
H3K27M <- GBM.genes.of.interest %>%
dplyr::select(contains(".H3K27M."), gene_symbol) %>%
gather(key = "Sample", value = "Expression27", -gene_symbol) %>%
mutate(type = "H3K27M")
WT <- GBM.genes.of.interest %>%
dplyr::select(contains(".WT."), gene_symbol) %>%
gather(key = "Sample", value = "Expression27", -gene_symbol) %>%
mutate(type = "WT")
GBM.genes.of.interest2 <- rbind(H3K27M, WT)
t_results <- list()
for (i in genes_of_interest){
gene_temp <- GBM.genes.of.interest2[GBM.genes.of.interest2$gene_symbol == i,]
t_results[[i]] <- t.test(gene_temp$Expression27[gene_temp$type == "H3K27M"],gene_temp$Expression27[gene_temp$type == "WT"])
}
# Make a table to store the results
t_results_sum <- data.frame(genes_of_interest,sapply(t_results, function(l) l[["statistic"]]), sapply(t_results, function(l) l[["p.value"]]))
colnames(t_results_sum)[2:3] <- c("t_statistic", "p_value")
rownames(t_results_sum) <- NULL
#print out the t-test results
kable(t_results_sum)
| genes_of_interest | t_statistic | p_value |
|---|---|---|
| OR4G4P | -0.7522158 | 0.4589634 |
| IRX1 | 5.1336629 | 0.0000099 |
| OSR1 | 5.6829774 | 0.0000034 |
| DCHS2 | 5.7324096 | 0.0000046 |
| BRINP3 | 5.1194522 | 0.0000159 |
| TOB2P1 | -4.3860210 | 0.0001389 |
| FOXD1 | -4.4889072 | 0.0001679 |
| ZFPM2 | 4.5714175 | 0.0001077 |
| GLB1 | -3.9100261 | 0.0004298 |
| ALG5 | -4.4610902 | 0.0000905 |
| TRIM4 | -3.7724880 | 0.0005934 |
| ADARB2 | 6.2938237 | 0.0000010 |
| PCDHGA11 | -1.1719705 | 0.2541779 |
| IDH1 | -1.5486643 | 0.1313748 |
| EGFR | -1.4214767 | 0.1691524 |
| MGMT | 0.9184095 | 0.3636800 |
| TERT | -1.2419552 | 0.2262129 |
| PTEN | -2.2353246 | 0.0315199 |
| TP53 | -0.0400511 | 0.9682425 |
| RB1 | -1.6058411 | 0.1208662 |
| ATRX | -0.9777387 | 0.3343672 |
| PDGFRA | -0.2277656 | 0.8215680 |
| PIK3CA | -0.2722628 | 0.7867548 |
| MICA | -2.3374256 | 0.0293613 |
| CDKN2A | -2.1599041 | 0.0423925 |
| EZH2 | 0.0030369 | 0.9975928 |
| BRD2 | 1.4820614 | 0.1472154 |
Now create a graphing function to create boxplots to visualize the results. Plot expression on the y-axis. The graph should look like this example
#to work in the tidyverse, it will be easier to make tidy the dataframe first
GBM.genes.of.interests_split <- GBM.genes.of.interest2 %>%
group_split(gene_symbol)
#create a graphing function
my_plot <- function(data){
p_temp <- ggplot(data, aes(x = type, y = Expression27, fill = type)) +
geom_boxplot() +
labs(title = paste0(data$gene_symbol[1]," Expression in GBM models \n by H3K27 Mutated or WT Status"), x = "H3K27", y = "Expression_(TPM)") +
theme(plot.title = element_text(hjust = 0.5),legend.position = "none") +
scale_x_discrete(labels = c('H3K27_Mutated','WT'))
print(p_temp)
}
#then use a for loop combined with the graphing function to make a graph for all your genes of interest
lapply(GBM.genes.of.interests_split, my_plot)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
##
## [[22]]
##
## [[23]]
##
## [[24]]
##
## [[25]]
##
## [[26]]
##
## [[27]]
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] AnnotationFilter_1.12.0 GenomicFeatures_1.40.1 AnnotationDbi_1.50.3
## [4] Biobase_2.48.0 GenomicRanges_1.40.0 GenomeInfoDb_1.24.2
## [7] IRanges_2.22.2 S4Vectors_0.26.1 BiocGenerics_0.34.0
## [10] knitr_1.29 forcats_0.5.0 stringr_1.4.0
## [13] dplyr_1.0.1 purrr_0.3.4 readr_1.3.1
## [16] tidyr_1.1.1 tibble_3.0.3 ggplot2_3.3.2
## [19] tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.20.0 matrixStats_0.56.0
## [3] bitops_1.0-6 fs_1.5.0
## [5] lubridate_1.7.9 bit64_4.0.2
## [7] progress_1.2.2 httr_1.4.2
## [9] tools_4.0.2 backports_1.1.8
## [11] R6_2.4.1 lazyeval_0.2.2
## [13] DBI_1.1.0 colorspace_1.4-1
## [15] withr_2.2.0 tidyselect_1.1.0
## [17] prettyunits_1.1.1 bit_4.0.4
## [19] curl_4.3 compiler_4.0.2
## [21] cli_2.0.2 rvest_0.3.6
## [23] xml2_1.3.2 DelayedArray_0.14.1
## [25] labeling_0.3 rtracklayer_1.48.0
## [27] scales_1.1.1 askpass_1.1
## [29] rappdirs_0.3.1 digest_0.6.25
## [31] Rsamtools_2.4.0 rmarkdown_2.3
## [33] XVector_0.28.0 pkgconfig_2.0.3
## [35] htmltools_0.5.0 highr_0.8
## [37] dbplyr_1.4.4 rlang_0.4.7
## [39] readxl_1.3.1 rstudioapi_0.11
## [41] RSQLite_2.2.0 farver_2.0.3
## [43] generics_0.0.2 jsonlite_1.7.0
## [45] BiocParallel_1.22.0 RCurl_1.98-1.2
## [47] magrittr_1.5 GenomeInfoDbData_1.2.3
## [49] Matrix_1.2-18 Rcpp_1.0.5
## [51] munsell_0.5.0 fansi_0.4.1
## [53] lifecycle_0.2.0 stringi_1.4.6
## [55] SummarizedExperiment_1.18.2 zlibbioc_1.34.0
## [57] BiocFileCache_1.12.1 grid_4.0.2
## [59] blob_1.2.1 crayon_1.3.4
## [61] lattice_0.20-41 Biostrings_2.56.0
## [63] haven_2.3.1 hms_0.5.3
## [65] pillar_1.4.6 biomaRt_2.44.1
## [67] reprex_0.3.0 XML_3.99-0.5
## [69] glue_1.4.1 evaluate_0.14
## [71] modelr_0.1.8 vctrs_0.3.2
## [73] cellranger_1.1.0 gtable_0.3.0
## [75] openssl_1.4.2 assertthat_0.2.1
## [77] xfun_0.16 broom_0.7.0
## [79] GenomicAlignments_1.24.0 memoise_1.1.0
## [81] ellipsis_0.3.1